{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8bc47910-45c3-433e-a865-a6ef4a7af366",
   "metadata": {},
   "source": [
    "## A02_Data Formatting Process\n",
    "\n",
    "This code will convert the raw data into a formatted DataFrame and store it into a formatted Parquet file."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf3f9e86-9aaf-4886-8b53-19895cee8529",
   "metadata": {},
   "source": [
    "### 1. init spark session"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f4d85c4c-9c2c-4975-8c51-4837169df20e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import findspark\n",
    "findspark.init()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "627b1415-b963-4ecf-b6f9-be9844eb56c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import SparkSession\n",
    "spark = SparkSession.builder.appName(\"income_data\").getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e2954b3-47c7-40f4-bc98-4e61c1e83229",
   "metadata": {},
   "source": [
    "### 2. read the json data files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d08a99aa-a77c-4bd5-81d8-1c4acdefc793",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_dir = \"./Landing Zone/income\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d16b4b9a-8386-41e2-9a91-20449eb1e52c",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = spark.read.csv(data_dir, header=True, inferSchema=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "eabf3844-4a3f-495e-af14-475ed518b89b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----+--------------+--------------+----------+--------------------+--------+-------------------------+\n",
      "| Any|Codi_Districte| Nom_Districte|Codi_Barri|           Nom_Barri|Població|Índex RFD Barcelona = 100|\n",
      "+----+--------------+--------------+----------+--------------------+--------+-------------------------+\n",
      "|2007|             1|  Ciutat Vella|         1|            el Raval|   46595|                     64.7|\n",
      "|2007|             1|  Ciutat Vella|         2|      el Barri Gòtic|   27946|                     86.5|\n",
      "|2007|             1|  Ciutat Vella|         3|      la Barceloneta|   15921|                     66.7|\n",
      "|2007|             1|  Ciutat Vella|         4|Sant Pere, Santa ...|   22572|                     80.2|\n",
      "|2007|             2|      Eixample|         5|       el Fort Pienc|   31521|                    107.9|\n",
      "|2007|             2|      Eixample|         6|  la Sagrada Família|   52185|                    101.8|\n",
      "|2007|             2|      Eixample|         7|la Dreta de l'Eix...|   42504|                    137.6|\n",
      "|2007|             2|      Eixample|         8|l'Antiga Esquerra...|   41413|                    126.5|\n",
      "|2007|             2|      Eixample|         9|la Nova Esquerra ...|   58146|                    116.9|\n",
      "|2007|             2|      Eixample|        10|         Sant Antoni|   37988|                    103.8|\n",
      "|2007|             3|Sants-Montjuïc|        11|        el Poble Sec|   39579|                     73.3|\n",
      "|2007|             3|Sants-Montjuïc|        12|la Marina del Pra...|    1005|                     80.4|\n",
      "|2007|             3|Sants-Montjuïc|        13|   la Marina de Port|   29327|                     80.2|\n",
      "|2007|             3|Sants-Montjuïc|        14|la Font de la Gua...|   10064|                     90.4|\n",
      "|2007|             3|Sants-Montjuïc|        15|         Hostafrancs|   15771|                     82.7|\n",
      "|2007|             3|Sants-Montjuïc|        16|          la Bordeta|   18592|                     81.9|\n",
      "|2007|             3|Sants-Montjuïc|        17|       Sants - Badal|   24085|                     85.9|\n",
      "|2007|             3|Sants-Montjuïc|        18|               Sants|   40272|                     89.5|\n",
      "|2007|             4|     Les Corts|        19|           les Corts|   46400|                    130.4|\n",
      "|2007|             4|     Les Corts|        20|la Maternitat i S...|   23938|                    127.9|\n",
      "+----+--------------+--------------+----------+--------------------+--------+-------------------------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "437ba981-5e9f-4acb-951f-a4c9bee275bd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- Any: integer (nullable = true)\n",
      " |-- Codi_Districte: integer (nullable = true)\n",
      " |-- Nom_Districte: string (nullable = true)\n",
      " |-- Codi_Barri: integer (nullable = true)\n",
      " |-- Nom_Barri: string (nullable = true)\n",
      " |-- Població: integer (nullable = true)\n",
      " |-- Índex RFD Barcelona = 100: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67533110-492c-4d8b-9a42-3ee845e54382",
   "metadata": {},
   "source": [
    "### 3. partition datas by district_id field for efficient query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "9182e9eb-bc02-40a3-9c2b-a487705796d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "repartitioned_df = df.repartition(\"Codi_Districte\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b8c2ec0-7f5e-4875-9443-36329271c213",
   "metadata": {},
   "source": [
    "### 4. save to parquet file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "682a1eec-e450-45d0-a49c-295ef9a0f515",
   "metadata": {},
   "outputs": [],
   "source": [
    "output_path = \"./Formatted Zone/income_data.parquet\"\n",
    "\n",
    "repartitioned_df.write.mode(\"overwrite\").parquet(output_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2adb3b40-a02b-4e78-92e7-354090f7a32e",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
